8 research outputs found
Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions
Despite the widespread success of Transformers on NLP tasks, recent works
have found that they struggle to model several formal languages when compared
to recurrent models. This raises the question of why Transformers perform well
in practice and whether they have any properties that enable them to generalize
better than recurrent models. In this work, we conduct an extensive empirical
study on Boolean functions to demonstrate the following: (i) Random
Transformers are relatively more biased towards functions of low sensitivity.
(ii) When trained on Boolean functions, both Transformers and LSTMs prioritize
learning functions of low sensitivity, with Transformers ultimately converging
to functions of lower sensitivity. (iii) On sparse Boolean functions which have
low sensitivity, we find that Transformers generalize near perfectly even in
the presence of noisy labels whereas LSTMs overfit and achieve poor
generalization accuracy. Overall, our results provide strong quantifiable
evidence that suggests differences in the inductive biases of Transformers and
recurrent models which may help explain Transformer's effective generalization
performance despite relatively limited expressiveness.Comment: Preprin
MAGNIFICo: Evaluating the In-Context Learning Ability of Large Language Models to Generalize to Novel Interpretations
Humans possess a remarkable ability to assign novel interpretations to
linguistic expressions, enabling them to learn new words and understand
community-specific connotations. However, Large Language Models (LLMs) have a
knowledge cutoff and are costly to finetune repeatedly. Therefore, it is
crucial for LLMs to learn novel interpretations in-context. In this paper, we
systematically analyse the ability of LLMs to acquire novel interpretations
using in-context learning. To facilitate our study, we introduce MAGNIFICo, an
evaluation suite implemented within a text-to-SQL semantic parsing framework
that incorporates diverse tokens and prompt settings to simulate real-world
complexity. Experimental results on MAGNIFICo demonstrate that LLMs exhibit a
surprisingly robust capacity for comprehending novel interpretations from
natural language descriptions as well as from discussions within long
conversations. Nevertheless, our findings also highlight the need for further
improvements, particularly when interpreting unfamiliar words or when composing
multiple novel interpretations simultaneously in the same example.
Additionally, our analysis uncovers the semantic predispositions in LLMs and
reveals the impact of recency bias for information presented in long contexts.Comment: EMNLP 202
DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization
With the increase in the scale of Deep Learning (DL) training workloads in
terms of compute resources and time consumption, the likelihood of encountering
in-training failures rises substantially, leading to lost work and resource
wastage. Such failures are typically offset by a checkpointing mechanism, which
comes at the cost of storage and network bandwidth overhead. State-of-the-art
approaches involve lossy model compression mechanisms, which induce a tradeoff
between the resulting model quality (accuracy) and compression ratio. Delta
compression is then used to further reduce the overhead by only storing the
difference between consecutive checkpoints. We make a key enabling observation
that the sensitivity of model weights to compression varies during training,
and different weights benefit from different quantization levels (ranging from
retaining full precision to pruning). We propose (1) a non-uniform quantization
scheme that leverages this variation, (2) an efficient search mechanism that
dynamically finds the best quantization configurations, and (3) a
quantization-aware delta compression mechanism that rearranges weights to
minimize checkpoint differences, thereby maximizing compression. We instantiate
these contributions in DynaQuant - a framework for DL workload checkpoint
compression. Our experiments show that DynaQuant consistently achieves a better
tradeoff between accuracy and compression ratios compared to prior works,
enabling a compression ratio up to 39x and withstanding up to 10 restores with
negligible accuracy impact for fault-tolerant training. DynaQuant achieves at
least an order of magnitude reduction in checkpoint storage overhead for
training failure recovery as well as transfer learning use cases without any
loss of accuracy